Extraction of Translation Equivalents from Parallel Corpora
نویسنده
چکیده
In th e p as t m u ch effo rt w as d ev o ted to th e com p ila tio n o f m u ltilin g u a l p a ra lle l co rp o ra for the p u rp o se o f lingu istic in fo rm atio n re tr ie v a l T h is p ap er a im s to in tro d u ce and evaluate th ree s im p le stra teg ies fo r th e ex trac tio n o f translation, equ ivalen ts fro m stru c tu red para lle l texts. T he g o a l is to su p p o rt the p ro d u c tio n o f b ilin g u a l d ic tionaries fo r d o m a in -sp ec ific app lications. T he ap p ro ach es d escribed in th e p ap e r a ssu m e sen tence a lignm ent, s tr ic t tran s la tio n s , and h istorical re la tio n s betw een con sid ered lan g u ag e pa irs . T hey take advan tage o f co rp u s ch aracteristics like sh o rt a lig n ed u n its and s truc tu ra l & o rth o g rap h ic s im ilarities in o rd er to o b ta in resu lts w ith a h ig h le v e l o f p recision . F u rtherm ore , it w ill b e sh o w n th a t au to m atic f ilte rin g c a n b e u sed to im prove th e p rec is io n o f the ex trac ted m a teria l. S im ple techn iques a re u sed to d e tec t transla tion can d id a te s th a t are m o st likely w rong.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملMeasuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents
In this paper we present and evaluate three approaches to measure comparability of documents in non-parallel corpora. We develop a task-oriented definition of comparability, based on the performance of automatic extraction of translation equivalents from the documents aligned by the proposed metrics, which formalises intuitive definitions of comparability for machine translation research. We de...
متن کاملExtraction of Translation Equivalents from Parallel Corpora Using Sense-sensitive Contexts
The paper proposes an unsupervised method to extract translation equivalents from parallel corpora. The strategy we use takes into account the context of words. Given a word of the source language and a particular context, we learn its word translation within an equivalent context. We first extract pairs of similar contexts and, then, we compare the similarity between words appearing in each pa...
متن کاملExtraction of Translation Equivalents from Non-Parallel Corpora
This paper presents a widely applicable method for extracting bilingual expressions from non-parallel corpora. The algorithm first collects word sequences as candidates for translation equivalents that match given patterns of word sequences from each corpus. Then, translation equivalents are selected from these candidates by aligning component words from within word sequences. We show the resul...
متن کاملLearning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary
So far, research on extraction of translation equivalents from comparable, non-parallel corpora has not been very popular. The main reason was the poor results when compared to those obtained from aligned parallel corpora. The method proposed in this paper, relying on seed patterns generated from external bilingual dictionaries, allows us to achieve similar results to those from parallel corpus...
متن کاملAutomatic Extraction of Translation Equivalents From Parallel Corpora
This paper presents a simple and effective method for extraction of translation equivalents from parallel corpora. Experiments were conducted on Orwell's "1984" parallel corpus with translations available in six CEE languages, all of them being aligned to the English original. There were extracted six bilingual lexicons X-English (En), where X stands for one of Czech (Cz), Bulgarian (Bg), Eston...
متن کامل